The dataset contains the following features:
Note: You can assume that the data is collected in the year 2016.
!pip install scikit-learn-extra
Requirement already satisfied: scikit-learn-extra in /usr/local/lib/python3.10/dist-packages (0.3.0) Requirement already satisfied: numpy>=1.13.3 in /usr/local/lib/python3.10/dist-packages (from scikit-learn-extra) (1.25.2) Requirement already satisfied: scipy>=0.19.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn-extra) (1.11.4) Requirement already satisfied: scikit-learn>=0.23.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn-extra) (1.2.2) Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.23.0->scikit-learn-extra) (1.3.2) Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.23.0->scikit-learn-extra) (3.4.0)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# To scale the data using z-score
from sklearn.preprocessing import StandardScaler
# Importing PCA and t-SNE
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
# Importing clustering algorithms
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn_extra.cluster import KMedoids
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import DBSCAN
# Silhouette score
from sklearn.metrics import silhouette_score
from statistics import mode
import warnings
warnings.filterwarnings("ignore")
data=pd.read_csv("https://github.com/jesusina/clustering_PCA/blob/main/marketing_campaign.csv?raw=true")
data.head()
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | ... | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5524 | 1957 | Graduation | Single | 58138.0 | 0 | 0 | 04-09-2012 | 58 | 635 | ... | 10 | 4 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 1 | 2174 | 1954 | Graduation | Single | 46344.0 | 1 | 1 | 08-03-2014 | 38 | 11 | ... | 1 | 2 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 4141 | 1965 | Graduation | Together | 71613.0 | 0 | 0 | 21-08-2013 | 26 | 426 | ... | 2 | 10 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 6182 | 1984 | Graduation | Together | 26646.0 | 1 | 0 | 10-02-2014 | 26 | 11 | ... | 0 | 4 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5324 | 1981 | PhD | Married | 58293.0 | 1 | 0 | 19-01-2014 | 94 | 173 | ... | 3 | 6 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 27 columns
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2240 entries, 0 to 2239 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 2240 non-null int64 1 Year_Birth 2240 non-null int64 2 Education 2240 non-null object 3 Marital_Status 2240 non-null object 4 Income 2216 non-null float64 5 Kidhome 2240 non-null int64 6 Teenhome 2240 non-null int64 7 Dt_Customer 2240 non-null object 8 Recency 2240 non-null int64 9 MntWines 2240 non-null int64 10 MntFruits 2240 non-null int64 11 MntMeatProducts 2240 non-null int64 12 MntFishProducts 2240 non-null int64 13 MntSweetProducts 2240 non-null int64 14 MntGoldProds 2240 non-null int64 15 NumDealsPurchases 2240 non-null int64 16 NumWebPurchases 2240 non-null int64 17 NumCatalogPurchases 2240 non-null int64 18 NumStorePurchases 2240 non-null int64 19 NumWebVisitsMonth 2240 non-null int64 20 AcceptedCmp3 2240 non-null int64 21 AcceptedCmp4 2240 non-null int64 22 AcceptedCmp5 2240 non-null int64 23 AcceptedCmp1 2240 non-null int64 24 AcceptedCmp2 2240 non-null int64 25 Complain 2240 non-null int64 26 Response 2240 non-null int64 dtypes: float64(1), int64(23), object(3) memory usage: 472.6+ KB
We have 24 missing values for income. We'll fill them in with normally distributed values matching the dataset, using median instead of mean to protect against outliers.
meanIncome, stdIncome=data['Income'].median(), data['Income'].std()
indices_to_be_updated=data[data['Income'].isnull()].index
data.loc[indices_to_be_updated, 'Income']=np.random.normal(meanIncome, stdIncome, len(indices_to_be_updated))
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2240 entries, 0 to 2239 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 2240 non-null int64 1 Year_Birth 2240 non-null int64 2 Education 2240 non-null object 3 Marital_Status 2240 non-null object 4 Income 2240 non-null float64 5 Kidhome 2240 non-null int64 6 Teenhome 2240 non-null int64 7 Dt_Customer 2240 non-null object 8 Recency 2240 non-null int64 9 MntWines 2240 non-null int64 10 MntFruits 2240 non-null int64 11 MntMeatProducts 2240 non-null int64 12 MntFishProducts 2240 non-null int64 13 MntSweetProducts 2240 non-null int64 14 MntGoldProds 2240 non-null int64 15 NumDealsPurchases 2240 non-null int64 16 NumWebPurchases 2240 non-null int64 17 NumCatalogPurchases 2240 non-null int64 18 NumStorePurchases 2240 non-null int64 19 NumWebVisitsMonth 2240 non-null int64 20 AcceptedCmp3 2240 non-null int64 21 AcceptedCmp4 2240 non-null int64 22 AcceptedCmp5 2240 non-null int64 23 AcceptedCmp1 2240 non-null int64 24 AcceptedCmp2 2240 non-null int64 25 Complain 2240 non-null int64 26 Response 2240 non-null int64 dtypes: float64(1), int64(23), object(3) memory usage: 472.6+ KB
Let's check for duplicates:
len(data['ID'].unique())
2240
data['Marital_Status'].value_counts()
Married 864 Together 580 Single 480 Divorced 232 Widow 77 Alone 3 Absurd 2 YOLO 2 Name: Marital_Status, dtype: int64
data['Education'].value_counts()
Graduation 1127 PhD 486 Master 370 2n Cycle 203 Basic 54 Name: Education, dtype: int64
Attempting to interpret these categories.
data.groupby('Education')['Income'].mean()
Education 2n Cycle 47721.263842 Basic 20306.259259 Graduation 52743.122172 Master 52782.396741 PhD 56072.733886 Name: Income, dtype: float64
sns.boxplot(data, x='Education', y='Income')
<Axes: xlabel='Education', ylabel='Income'>
We have a single outlier in income, roughly 3 times that of any other data points. Who is this person?
data[data['Income']>200000]
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | ... | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2233 | 9432 | 1977 | Graduation | Together | 666666.0 | 1 | 0 | 02-06-2013 | 23 | 9 | ... | 1 | 3 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 rows × 27 columns
This person was likely not reporting their income seriously. Because this person has other data, let's keep this person, but change their income to the average for their "Graduation" education level.
data.at[2233,'Income']=data.groupby('Education')['Income'].median()['Graduation']
sns.boxplot(data, x='Education', y='Income', hue='Marital_Status')
<Axes: xlabel='Education', ylabel='Income'>
data['Dt_Customer'].value_counts()
31-08-2012 12
12-09-2012 11
14-02-2013 11
12-05-2014 11
20-08-2013 10
..
05-08-2012 1
18-11-2012 1
25-05-2013 1
14-04-2013 1
09-01-2014 1
Name: Dt_Customer, Length: 663, dtype: int64
From the context of these columns, we are dealing with a grocery story with an online presence, trying to determine how to spend its marketing budget to maximize profit.
There are 2240 unique customers and no missing values in the dataset. Not counting customer ID, we have 26 columns, of which 23 are numerical. Nine of these numerical columns are dummy variables. The remaining 3 non-numerical columns include a date string (Dt_Customer), education, and marital status. These need to be treated carefully.
Dt_Customer: There are 663 unique dates on which our 2240 customers were acquired.
Education: Based on univariate analysis in income, "Basic" here seems to indicate less than a bachelors degree. "2ncycle" is the European equivalent of masters degree, though it may be worthwhile to keep around as a distinct category.
Marital Status: the responses given here indicate customers were permitted to fill out an "Other" category with a customer string response. The author of this notebook would advise the data engineers to modify this to only permit the choices: married, together, single, divorced, widow, since "alone" should be taken as single and the remaining answers of "absurd" and "YOLO" are obviously not useful. We'll deal with the treatment of these below.
Converting "Alone" to "Single":
data['Marital_Status']=data['Marital_Status'].apply(lambda x: 'Single' if x=='Alone' else x)
data['Marital_Status'].value_counts()
Married 864 Together 580 Single 483 Divorced 232 Widow 77 Absurd 2 YOLO 2 Name: Marital_Status, dtype: int64
Let's clean up the date column
data['Dt_Customer']=pd.to_datetime(data['Dt_Customer'],dayfirst=True)
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2240 entries, 0 to 2239 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 2240 non-null int64 1 Year_Birth 2240 non-null int64 2 Education 2240 non-null object 3 Marital_Status 2240 non-null object 4 Income 2240 non-null float64 5 Kidhome 2240 non-null int64 6 Teenhome 2240 non-null int64 7 Dt_Customer 2240 non-null datetime64[ns] 8 Recency 2240 non-null int64 9 MntWines 2240 non-null int64 10 MntFruits 2240 non-null int64 11 MntMeatProducts 2240 non-null int64 12 MntFishProducts 2240 non-null int64 13 MntSweetProducts 2240 non-null int64 14 MntGoldProds 2240 non-null int64 15 NumDealsPurchases 2240 non-null int64 16 NumWebPurchases 2240 non-null int64 17 NumCatalogPurchases 2240 non-null int64 18 NumStorePurchases 2240 non-null int64 19 NumWebVisitsMonth 2240 non-null int64 20 AcceptedCmp3 2240 non-null int64 21 AcceptedCmp4 2240 non-null int64 22 AcceptedCmp5 2240 non-null int64 23 AcceptedCmp1 2240 non-null int64 24 AcceptedCmp2 2240 non-null int64 25 Complain 2240 non-null int64 26 Response 2240 non-null int64 dtypes: datetime64[ns](1), float64(1), int64(23), object(2) memory usage: 472.6+ KB
data['Year'], data['Month'], data['Day']=data['Dt_Customer'].dt.year, data['Dt_Customer'].dt.month, data['Dt_Customer'].dt.day
data=data.drop(['Dt_Customer'],axis=1)
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2240 entries, 0 to 2239 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 2240 non-null int64 1 Year_Birth 2240 non-null int64 2 Education 2240 non-null object 3 Marital_Status 2240 non-null object 4 Income 2240 non-null float64 5 Kidhome 2240 non-null int64 6 Teenhome 2240 non-null int64 7 Recency 2240 non-null int64 8 MntWines 2240 non-null int64 9 MntFruits 2240 non-null int64 10 MntMeatProducts 2240 non-null int64 11 MntFishProducts 2240 non-null int64 12 MntSweetProducts 2240 non-null int64 13 MntGoldProds 2240 non-null int64 14 NumDealsPurchases 2240 non-null int64 15 NumWebPurchases 2240 non-null int64 16 NumCatalogPurchases 2240 non-null int64 17 NumStorePurchases 2240 non-null int64 18 NumWebVisitsMonth 2240 non-null int64 19 AcceptedCmp3 2240 non-null int64 20 AcceptedCmp4 2240 non-null int64 21 AcceptedCmp5 2240 non-null int64 22 AcceptedCmp1 2240 non-null int64 23 AcceptedCmp2 2240 non-null int64 24 Complain 2240 non-null int64 25 Response 2240 non-null int64 26 Year 2240 non-null int64 27 Month 2240 non-null int64 28 Day 2240 non-null int64 dtypes: float64(1), int64(26), object(2) memory usage: 507.6+ KB
fig = plt.figure(figsize = (18, 6))
sns.heatmap(data.corr(), annot = True);
plt.xticks(rotation = 45);
Month and day of joining have minimal correlations to our other variables, excepting an interesting connection between month and year.
import plotly.express as px
px.box(data, x='Year', y='Month', points="all")
The negative correlation is purely an artifact of the time window being considered (July 2012-June 2014). Let's drop month and day.
data=data.drop(['Month', 'Day'],axis=1)
Let's look at our YOLO and Absurd Marital_Status points. We note that Teenhome is the strongest correlate to birth year.
sns.catplot(data, y='Year_Birth', x='Marital_Status', hue='Kidhome', col='Teenhome')
<seaborn.axisgrid.FacetGrid at 0x7a25148a2230>
Exploratory Data Analysis doesn't give much in the way of conclusions for YOLO and "Absurd" marital status. We'll assign them to be married, as the most common marital status. Also, we note outliers in Year_Birth, which we'll assign to the average Year_Birth
data['Marital_Status']=data['Marital_Status'].apply(lambda x: 'Married' if x=='YOLO' or x=="Absurd" else x)
data['Year_Birth']=data['Year_Birth'].apply(lambda x: data['Year_Birth'].mean() if x<1920 else x)
#data['Marital_Status']=data['Marital_Status'].apply(lambda x: data.groupby('Year_Birth')['Marital_Status'].mode() if x=='YOLO' or x=="Absurd" else x)
#data.groupby('Teenhome')['Year_Birth'].mean()[1]
#data[data['Year_Birth']<1920]['Year_Birth']
#data['Marital_Status']=data['Marital_Status'].apply(lambda x: 'Single' if x=='Alone' else x)
#data['Year_Birth']=data['Year_Birth'].apply(lambda x: data.groupby('Teenhome')['Year_Birth'].mean() if x<1920 else x)
#data[data['Year_Birth']<1920]['Year_Birth']=data.groupby('Kidhome')['Year_Birth'].mean()[]
sns.catplot(data, y='Year_Birth', x='Marital_Status', hue='Kidhome', col='Education')
<seaborn.axisgrid.FacetGrid at 0x7a2510472140>
Dropping customer ID, which has been confirmed to be uncorrelated to any of our other variables.
data=data.drop(['ID'], axis=1)
data_dums=pd.get_dummies(data, drop_first=True)
data_dums.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Year_Birth | 2240.0 | 1968.901526 | 11.694076 | 1940.0 | 1959.00 | 1970.0 | 1977.00 | 1996.0 |
| Income | 2240.0 | 51960.598085 | 21523.288578 | 1730.0 | 35303.00 | 51381.5 | 68468.25 | 162397.0 |
| Kidhome | 2240.0 | 0.444196 | 0.538398 | 0.0 | 0.00 | 0.0 | 1.00 | 2.0 |
| Teenhome | 2240.0 | 0.506250 | 0.544538 | 0.0 | 0.00 | 0.0 | 1.00 | 2.0 |
| Recency | 2240.0 | 49.109375 | 28.962453 | 0.0 | 24.00 | 49.0 | 74.00 | 99.0 |
| MntWines | 2240.0 | 303.935714 | 336.597393 | 0.0 | 23.75 | 173.5 | 504.25 | 1493.0 |
| MntFruits | 2240.0 | 26.302232 | 39.773434 | 0.0 | 1.00 | 8.0 | 33.00 | 199.0 |
| MntMeatProducts | 2240.0 | 166.950000 | 225.715373 | 0.0 | 16.00 | 67.0 | 232.00 | 1725.0 |
| MntFishProducts | 2240.0 | 37.525446 | 54.628979 | 0.0 | 3.00 | 12.0 | 50.00 | 259.0 |
| MntSweetProducts | 2240.0 | 27.062946 | 41.280498 | 0.0 | 1.00 | 8.0 | 33.00 | 263.0 |
| MntGoldProds | 2240.0 | 44.021875 | 52.167439 | 0.0 | 9.00 | 24.0 | 56.00 | 362.0 |
| NumDealsPurchases | 2240.0 | 2.325000 | 1.932238 | 0.0 | 1.00 | 2.0 | 3.00 | 15.0 |
| NumWebPurchases | 2240.0 | 4.084821 | 2.778714 | 0.0 | 2.00 | 4.0 | 6.00 | 27.0 |
| NumCatalogPurchases | 2240.0 | 2.662054 | 2.923101 | 0.0 | 0.00 | 2.0 | 4.00 | 28.0 |
| NumStorePurchases | 2240.0 | 5.790179 | 3.250958 | 0.0 | 3.00 | 5.0 | 8.00 | 13.0 |
| NumWebVisitsMonth | 2240.0 | 5.316518 | 2.426645 | 0.0 | 3.00 | 6.0 | 7.00 | 20.0 |
| AcceptedCmp3 | 2240.0 | 0.072768 | 0.259813 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp4 | 2240.0 | 0.074554 | 0.262728 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp5 | 2240.0 | 0.072768 | 0.259813 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp1 | 2240.0 | 0.064286 | 0.245316 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp2 | 2240.0 | 0.012946 | 0.113069 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Complain | 2240.0 | 0.009375 | 0.096391 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Response | 2240.0 | 0.149107 | 0.356274 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Year | 2240.0 | 2013.028125 | 0.684554 | 2012.0 | 2013.00 | 2013.0 | 2013.00 | 2014.0 |
| Education_Basic | 2240.0 | 0.024107 | 0.153416 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Education_Graduation | 2240.0 | 0.503125 | 0.500102 | 0.0 | 0.00 | 1.0 | 1.00 | 1.0 |
| Education_Master | 2240.0 | 0.165179 | 0.371425 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Education_PhD | 2240.0 | 0.216964 | 0.412270 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Marital_Status_Married | 2240.0 | 0.387500 | 0.487288 | 0.0 | 0.00 | 0.0 | 1.00 | 1.0 |
| Marital_Status_Single | 2240.0 | 0.215625 | 0.411347 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Marital_Status_Together | 2240.0 | 0.258929 | 0.438144 | 0.0 | 0.00 | 0.0 | 1.00 | 1.0 |
| Marital_Status_Widow | 2240.0 | 0.034375 | 0.182231 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
1. What is the summary statistics of the data? Explore summary statistics for numerical variables and the categorical variables
We note a large number of dummy variable columns, where the value is equal to 0 or 1, where the mean is simply the percentage of people having that attribute (e.g. 0.9% of correspondents had complained).
2. Find out number of unique observations in each category of categorical columns? Write your findings/observations/insights
data['Marital_Status'].value_counts(), data['Marital_Status'].value_counts()/len(data)
(Married 868 Together 580 Single 483 Divorced 232 Widow 77 Name: Marital_Status, dtype: int64, Married 0.387500 Together 0.258929 Single 0.215625 Divorced 0.103571 Widow 0.034375 Name: Marital_Status, dtype: float64)
data['Education'].value_counts(), data['Education'].value_counts()/len(data)
(Graduation 1127 PhD 486 Master 370 2n Cycle 203 Basic 54 Name: Education, dtype: int64, Graduation 0.503125 PhD 0.216964 Master 0.165179 2n Cycle 0.090625 Basic 0.024107 Name: Education, dtype: float64)
The largest segment of our customer based (868 of them, 39%) are married, followed by "Together" (580, 26%) and Single (483, 22%).
The largest segment of our customer base (1127, 50%) had graduated high school, while 486 of them (22%) had PhDs.
3. Are all categories different from each other or can we combine some categories? Is 2n Cycle different from Master?
Though 2n Cycle is essentially the European equivalent of a Masters, we leave the categories as distinct, in the event that they yield some predictive power, if nothing else than as a stand-in for country.
4. There are 8 categories in Marital_Status with some categories having very low count of less than 5. Can we combine these categories with other categories?
Yes, we have treated this condition. "Alone" was changed to "Single" while "Absurd" and "YOLO" were converted to "Married" as the modal Marital_Status.
Univariate analysis is used to explore each variable in a data set, separately. It looks at the range of values, as well as the central tendency of the values. It can be done for both numerical and categorical variables.
Leading Questions:
data[data['Income']>120000]
| Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | ... | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | Year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 164 | 1973.0 | PhD | Married | 157243.0 | 0 | 1 | 98 | 20 | 2 | 1582 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2014 |
| 617 | 1976.0 | PhD | Together | 162397.0 | 1 | 1 | 31 | 85 | 1 | 16 | ... | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2013 |
| 655 | 1975.0 | Graduation | Divorced | 153924.0 | 0 | 0 | 81 | 1 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2014 |
| 687 | 1982.0 | PhD | Married | 160803.0 | 0 | 0 | 21 | 55 | 16 | 1622 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2012 |
| 1300 | 1971.0 | Master | Together | 157733.0 | 1 | 0 | 37 | 39 | 1 | 9 | ... | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2013 |
| 1653 | 1977.0 | Graduation | Together | 157146.0 | 0 | 0 | 13 | 1 | 0 | 1725 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2013 |
| 2132 | 1949.0 | PhD | Married | 156924.0 | 0 | 0 | 85 | 2 | 1 | 2 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2013 |
7 rows × 26 columns
These 7 outliers represent 0.3% of the data. As such, the right whisker is at the 99.7th percentile. We elect to replace these incomes with incomes normally distributed around the median.
indices_to_be_updated=data[data['Income']>120000].index
data.loc[indices_to_be_updated, 'Income']=np.random.normal(meanIncome, stdIncome, len(indices_to_be_updated))
numerical_cols = [col for col in data.columns if pd.api.types.is_numeric_dtype(data[col])]
for col in numerical_cols:
print(col)
print('Skew :', round(data[col].skew(), 2))
plt.figure(figsize = (15, 4))
plt.subplot(1, 2, 1)
data[col].hist(bins = 10, grid = False)
plt.ylabel('count')
plt.subplot(1, 2, 2)
sns.boxplot(x = data[col])
plt.show()
Year_Birth Skew : -0.09
Income Skew : 0.01
Kidhome Skew : 0.64
Teenhome Skew : 0.41
Recency Skew : -0.0
MntWines Skew : 1.18
MntFruits Skew : 2.1
MntMeatProducts Skew : 2.08
MntFishProducts Skew : 1.92
MntSweetProducts Skew : 2.14
MntGoldProds Skew : 1.89
NumDealsPurchases Skew : 2.42
NumWebPurchases Skew : 1.38
NumCatalogPurchases Skew : 1.88
NumStorePurchases Skew : 0.7
NumWebVisitsMonth Skew : 0.21
AcceptedCmp3 Skew : 3.29
AcceptedCmp4 Skew : 3.24
AcceptedCmp5 Skew : 3.29
AcceptedCmp1 Skew : 3.56
AcceptedCmp2 Skew : 8.62
Complain Skew : 10.19
Response Skew : 1.97
Year Skew : -0.04
We note the following about the various columns:
fig = plt.figure(figsize = (28, 12))
sns.heatmap(data_dums.corr(), annot = True);
plt.xticks(rotation = 70);
sns.catplot(data, y='Year_Birth', x='Marital_Status', hue='Education')
<seaborn.axisgrid.FacetGrid at 0x7a251605f8b0>
sns.boxplot(data, y='Year_Birth', x='Marital_Status', hue='Education')
<Axes: xlabel='Marital_Status', ylabel='Year_Birth'>
Key observations are that PhDs and Masters tend to be older and Basic education tends to be younger. The 2n Cycles seem consistently a bit older than their masters counterparts, which means keeping them as separate categories is worthwhie.
sns.boxplot(data, y='Income', x='Education', hue='Marital_Status')
<Axes: xlabel='Education', ylabel='Income'>
The only notable difference in education and income is for Basic education to earn a lot less. We also note that Widows earn more, but then again they are older so this is expected (corr (Income, Year_Birth)=-0.2
Let's see who does the shopping;
plt.figure(figsize=(15, 7))
plt.subplot(1, 2, 1)
sns.boxplot(data, y='Income', x='NumWebPurchases')
plt.subplot(1, 2, 2)
sns.boxplot(data, y='Income', x='NumStorePurchases')
plt.show()
plt.figure(figsize=(15, 7))
plt.subplot(1, 2, 1)
sns.boxplot(data, x='NumWebVisitsMonth', y='Income')
plt.subplot(1, 2, 2)
sns.scatterplot(data, y='NumWebVisitsMonth', x='Income', hue="Education")
plt.tight_layout()
plt.show()
These two graphs tell interesting and complementary stories. The first says that the fewer number web visits per month, the higher income the person while the extreme right skew of customers making a lot of visits earn a lot less. The graph at right shows this as well, that only the very lowest income customers visit the website more than 10 times per month. This seems to contradict that the highest income individuals made the most online purchases.
sns.jointplot(data, x='NumWebVisitsMonth', y='NumWebPurchases' ,hue="Education");
Unfortunately, correlation is not robust to outliers, so these 9 people might be constituting the entirety of the -0.52 correlation. Let's check this.
NumWebPurchases_no_outliers=data[data['NumWebPurchases']<15]['NumWebPurchases']
NumWebVisitsMonth_no_outliers=data[data['NumWebVisitsMonth']<15]['NumWebVisitsMonth']
print(NumWebPurchases_no_outliers.corr(NumWebVisitsMonth_no_outliers))
sns.scatterplot(x=NumWebVisitsMonth_no_outliers, y=NumWebPurchases_no_outliers)
-0.010576900440938886
<Axes: xlabel='NumWebVisitsMonth', ylabel='NumWebPurchases'>
The heatmap above is a bit unwieldy, but it may prove useful to look at two subsets of it:
cols_with_Mnt = [col for col in data.columns if "Mnt" in col]
cols_with_Mnt_Edu = [col for col in data.columns if "Mnt" in col or 'Education' in col]
cols_with_accepted = [col for col in data.columns if "Accepted" in col or "Response" in col]
print(cols_with_Mnt)
print(cols_with_accepted)
['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds'] ['AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1', 'AcceptedCmp2', 'Response']
fig = plt.figure(figsize = (18, 6))
sns.heatmap(data[cols_with_Mnt].corr(), annot = True);
plt.xticks(rotation = 45);
sns.pairplot(data[cols_with_Mnt_Edu], corner=True, hue='Education')
<seaborn.axisgrid.PairGrid at 0x7a2514d84d90>
fig = plt.figure(figsize = (18, 6))
sns.heatmap(data[cols_with_accepted].corr(), annot = True);
plt.xticks(rotation = 45);
Not a whole lot of information other than responding to the last campaign was modestly correlated with responding to any prior campaign, thooguh the third campaign also seemed oddly unique in its strategy for not being correlated to other campaigns.
means=data[cols_with_accepted].mean()
plt.xticks(rotation = 45);
plt.ylabel("Success rate")
sns.barplot(means)
<Axes: ylabel='Success rate'>
Campaign 2 was the least successful. The last campaign was the most successful. Let's break this down by demographic.
import numpy
for col in cols_with_accepted:
print(col)
demo_data=data.pivot_table(index='Marital_Status', columns='Education', values=col)
plt.figure(figsize = (5, 5)) # To resize the plot
sns.heatmap(demo_data, cmap = 'viridis', annot = True)
plt.show()
AcceptedCmp3
AcceptedCmp4
AcceptedCmp5
AcceptedCmp1
AcceptedCmp2
Response
demo_data=data.pivot_table(index='Marital_Status', columns='Education', values='NumDealsPurchases')
plt.figure(figsize = (5, 5)) # To resize the plot
sns.heatmap(demo_data, cmap = 'viridis', annot = True)
plt.show()
In this section, we will first prepare our dataset for analysis.
Think About It:
new_cols=[]
#current customer age
data['Age']=2016-data['Year_Birth']
new_cols.append('Age')#keeping a running list of my new cols for EDA
data.drop('Year_Birth', axis=1, inplace=True)
#total kids
data['TotDependents']=data['Kidhome']+data['Teenhome']
new_cols.append('TotDependents')
#family size
data['Adults']=data.apply(lambda x: 1 if x['Marital_Status'] in ['Single ', 'Widow', 'Divorced'] else 2, axis=1)
data['FamilySize']=data['TotDependents']+data['Adults']
new_cols.append('FamilySize')
#total spent
data['TotSpent']=data[cols_with_Mnt].sum(axis=1)
new_cols.append('TotSpent')
#customer longevity
data['Longevity']=2016-data['Year']
new_cols.append('Longevity')
#total offers accepted
#data[cols_with_accepted].corr()>0.5
print(cols_with_accepted)
['AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1', 'AcceptedCmp2', 'Response']
data['TotAccepted']=data[cols_with_accepted].sum(axis=1)
new_cols.append('TotAccepted')
#Total number purchases
data['TotNumPurchased']=data['NumCatalogPurchases']+data['NumStorePurchases']+data['NumWebPurchases']
new_cols.append('TotNumPurchased')
#Spent per purchase, Divide by 0 exception set to 0
data['AvgPerPurchase']=data.apply(lambda x: 0 if x['TotNumPurchased']==0 else x['TotSpent'] / x['TotNumPurchased'], axis=1)
new_cols.append('AvgPerPurchase')
#data[cols_with_accepted].corr()>0.5
data['AvgPerPurchase'].min()
0.0
Statistics for engineered features.
for col in new_cols:
print(col)
print('Skew :', round(data[col].skew(), 2))
plt.figure(figsize = (15, 4))
plt.subplot(1, 2, 1)
data[col].hist(bins = 10, grid = False)
plt.ylabel('count')
plt.subplot(1, 2, 2)
sns.boxplot(x = data[col])
plt.show()
Age Skew : 0.09
TotDependents Skew : 0.42
FamilySize Skew : 0.18
TotSpent Skew : 0.86
Longevity Skew : 0.04
TotAccepted Skew : 2.43
TotNumPurchased Skew : 0.3
AvgPerPurchase Skew : 20.75
Removing one outlier from AvgPerPurchase.
#data[data['AvgPerPurchase']>500]['AvgPerPurchase']=data['AvgPerPurchase'].mean()
data.loc[data['AvgPerPurchase'] > 500, 'AvgPerPurchase']=data['AvgPerPurchase'].mean()
data[data['AvgPerPurchase']>500]['AvgPerPurchase']
Series([], Name: AvgPerPurchase, dtype: float64)
#data[cols_with_accepted].corr()>0.5
print('Skew :', round(data['AvgPerPurchase'].skew(), 2))
plt.figure(figsize = (15, 4))
plt.subplot(1, 2, 1)
data['AvgPerPurchase'].hist(bins = 10, grid = False)
plt.ylabel('count')
plt.subplot(1, 2, 2)
sns.boxplot(x = data['AvgPerPurchase'])
plt.show()
Skew : 1.3
What are the the most important observations and insights from the data based on the EDA and Data Preprocessing performed?
Separating out behavior coluns: We note that cols_with_Mnt and cols_with_accepted are already subsets of behavior. To this we'll add: Recency, all Num__Purchases, NumWebVisitsMonth, and Complain and new columns: AvgPerPurchase,TotSpent ,TotAccepted
#data[cols_with_accepted].corr()>0.5
cols_behavior = cols_with_Mnt+cols_with_accepted+[col for col in data.columns if "Num" in col]+['TotAccepted','Complain','AvgPerPurchase','TotSpent', 'Recency']
print(cols_behavior)
['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1', 'AcceptedCmp2', 'Response', 'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth', 'TotNumPurchased', 'TotAccepted', 'Complain', 'AvgPerPurchase', 'TotSpent', 'Recency']
We also need demographic columns: TotDependents, Age, FamilySize,Education, Marital_Status, Income
data_dums=pd.get_dummies(data, drop_first=True)
cols_demo=['TotDependents', 'Age', 'FamilySize', 'Income']+[cols for cols in data_dums.columns if "Status" in cols]+[cols for cols in data_dums.columns if "Education" in cols]
data_behavior=data[cols_behavior]
data_demo=data_dums[cols_demo]
plt.figure(figsize = (20, 20))
sns.heatmap(data_behavior.corr(), annot = True, cmap = "YlGnBu")
plt.show()
Looking at all the behavior columns, there are some notably strong correlations:
From this, we conclude that wine and meat were more expensive than other products and that more of these items were bought through the catalog than through other means.
Meat purchases were mostly done through catalog and total number of purchases did not vary strongly by means (web, catalog, or store).
Very little substitional effect: big shoppers in one category were also big shoppers in the next category.
On the other hand, Accepted 2, Accepted 3, and Recency, and Complain were all poorly correlated with anything, we will drop them.
data_behavior.drop(columns=['AcceptedCmp2', 'AcceptedCmp3', 'Recency','Complain'], inplace=True)
scaler=StandardScaler()
data_behavior_scaled=pd.DataFrame(scaler.fit_transform(data_behavior), columns=data_behavior.columns)
data_behavior_scaled.head()
| MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | Response | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | TotNumPurchased | TotAccepted | AvgPerPurchase | TotSpent | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.983781 | 1.551577 | 1.679702 | 2.462147 | 1.476500 | 0.843207 | -0.28383 | -0.28014 | -0.262111 | 2.388846 | 0.349414 | 1.409304 | 2.510890 | -0.550785 | 0.693904 | 1.313544 | 0.623174 | 1.201453 | 1.679417 |
| 1 | -0.870479 | -0.636301 | -0.713225 | -0.650449 | -0.631503 | -0.729006 | -0.28383 | -0.28014 | -0.262111 | -0.418612 | -0.168236 | -1.110409 | -0.568720 | -1.166125 | -0.130463 | -1.185022 | -0.502560 | -1.018283 | -0.961275 |
| 2 | 0.362723 | 0.570804 | -0.177032 | 1.345274 | -0.146905 | -0.038766 | -0.28383 | -0.28014 | -0.262111 | -0.418612 | -0.685887 | 1.409304 | -0.226541 | 1.295237 | -0.542647 | 1.035926 | -0.502560 | 0.047523 | 0.282673 |
| 3 | -0.870479 | -0.560857 | -0.651187 | -0.503974 | -0.583043 | -0.748179 | -0.28383 | -0.28014 | -0.262111 | -0.418612 | -0.168236 | -0.750450 | -0.910898 | -0.550785 | 0.281720 | -0.907403 | -0.502560 | -0.949003 | -0.918094 |
| 4 | -0.389085 | 0.419916 | -0.216914 | 0.155164 | -0.001525 | -0.556446 | -0.28383 | -0.28014 | -0.262111 | -0.418612 | 1.384715 | 0.329427 | 0.115638 | 0.064556 | -0.130463 | 0.203070 | -0.502560 | -0.240366 | -0.305254 |
data_behavior_scaled.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| MntWines | 2240.0 | -7.612958e-17 | 1.000223 | -0.903167 | -0.832592 | -0.387599 | 0.595248 | 3.533390 |
| MntFruits | 2240.0 | -2.379049e-17 | 1.000223 | -0.661449 | -0.636301 | -0.460265 | 0.168436 | 4.343008 |
| MntMeatProducts | 2240.0 | 4.123686e-17 | 1.000223 | -0.739813 | -0.668912 | -0.442913 | 0.288259 | 6.904261 |
| MntFishProducts | 2240.0 | 1.506731e-17 | 1.000223 | -0.687068 | -0.632140 | -0.467355 | 0.228401 | 4.055064 |
| MntSweetProducts | 2240.0 | -1.506731e-17 | 1.000223 | -0.655733 | -0.631503 | -0.461894 | 0.143854 | 5.716737 |
| MntGoldProds | 2240.0 | -1.110223e-17 | 1.000223 | -0.844046 | -0.671486 | -0.383886 | 0.229660 | 6.096698 |
| AcceptedCmp4 | 2240.0 | 5.551115e-17 | 1.000223 | -0.283830 | -0.283830 | -0.283830 | -0.283830 | 3.523233 |
| AcceptedCmp5 | 2240.0 | -3.013462e-17 | 1.000223 | -0.280140 | -0.280140 | -0.280140 | -0.280140 | 3.569640 |
| AcceptedCmp1 | 2240.0 | 6.264830e-17 | 1.000223 | -0.262111 | -0.262111 | -0.262111 | -0.262111 | 3.815174 |
| Response | 2240.0 | -7.295751e-17 | 1.000223 | -0.418612 | -0.418612 | -0.418612 | -0.418612 | 2.388846 |
| NumDealsPurchases | 2240.0 | -9.833404e-17 | 1.000223 | -1.203537 | -0.685887 | -0.168236 | 0.349414 | 6.561217 |
| NumWebPurchases | 2240.0 | -5.709718e-17 | 1.000223 | -1.470368 | -0.750450 | -0.030532 | 0.689386 | 8.248526 |
| NumCatalogPurchases | 2240.0 | 9.516197e-18 | 1.000223 | -0.910898 | -0.910898 | -0.226541 | 0.457817 | 8.670110 |
| NumStorePurchases | 2240.0 | 7.612958e-17 | 1.000223 | -1.781466 | -0.858455 | -0.243114 | 0.679896 | 2.218248 |
| NumWebVisitsMonth | 2240.0 | -6.344132e-17 | 1.000223 | -2.191381 | -0.954831 | 0.281720 | 0.693904 | 6.052291 |
| TotNumPurchased | 2240.0 | -1.118153e-16 | 1.000223 | -1.740259 | -0.907403 | -0.074548 | 0.758307 | 2.701637 |
| TotAccepted | 2240.0 | -2.537653e-17 | 1.000223 | -0.502560 | -0.502560 | -0.502560 | 0.623174 | 5.126107 |
| AvgPerPurchase | 2240.0 | -1.110223e-16 | 1.000223 | -1.242751 | -0.810443 | -0.256993 | 0.389004 | 4.998005 |
| TotSpent | 2240.0 | 3.647876e-17 | 1.000223 | -0.997813 | -0.891937 | -0.348436 | 0.730262 | 3.187435 |
# Creating copy of the data to store labels from each algorithm
data_behavior_scaled_copy = data_behavior_scaled.copy(deep = True)
# Defining the number of principal components to generate
n = data_behavior_scaled.shape[1]
print("Max number of principal components", n)
# Finding principal components for the data
cols=['PC1','PC2']
pca1 = PCA(n_components = n, random_state = 1)
data_pca = pd.DataFrame(pca1.fit_transform(data_behavior_scaled))
#transformed_data=pd.DataFrame(data_pca, columns=cols)
# The percentage of variance explained by each principal component
exp_var1 = pca1.explained_variance_ratio_
Max number of principal components 19
# Visualize the explained variance by individual components
plt.figure(figsize = (10, 10))
plt.plot(range(1, n+1), pca1.explained_variance_ratio_.cumsum(), marker = 'o', linestyle = '--')
plt.title("Explained Variances by Components")
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")
plt.show()
# Find the least number of components that can explain more than 70% variance
sum = 0
for ix, i in enumerate(exp_var1):
sum = sum + i
print(i, sum)
if(sum>0.70):
print("Number of PCs that explain at least 70% variance: ", ix + 1)
break
0.43315864706841356 0.43315864706841356 0.12136306680420306 0.5545217138726166 0.09061677039937804 0.6451384842719947 0.05372784202593959 0.6988663262979343 0.039727082771396215 0.7385934090693306 Number of PCs that explain at least 70% variance: 5
colsall=[]
for i in range(1,ix+2):
colsall.append("PC"+ str(i))
print(colsall)
['PC1', 'PC2', 'PC3', 'PC4', 'PC5']
pc1 = pd.DataFrame(np.round(pca1.components_.T[:, 0:ix+1], 2), index = data_behavior_scaled.columns, columns = colsall)
pc1
| PC1 | PC2 | PC3 | PC4 | PC5 | |
|---|---|---|---|---|---|
| MntWines | 0.29 | 0.08 | 0.18 | 0.25 | 0.13 |
| MntFruits | 0.23 | -0.18 | -0.17 | -0.21 | -0.15 |
| MntMeatProducts | 0.28 | -0.06 | -0.17 | -0.02 | 0.40 |
| MntFishProducts | 0.24 | -0.18 | -0.18 | -0.20 | -0.17 |
| MntSweetProducts | 0.23 | -0.16 | -0.15 | -0.19 | -0.29 |
| MntGoldProds | 0.20 | -0.12 | 0.11 | -0.32 | -0.21 |
| AcceptedCmp4 | 0.10 | 0.36 | 0.17 | 0.48 | -0.19 |
| AcceptedCmp5 | 0.18 | 0.35 | -0.14 | 0.11 | 0.02 |
| AcceptedCmp1 | 0.16 | 0.32 | -0.10 | -0.09 | -0.40 |
| Response | 0.11 | 0.40 | 0.03 | -0.51 | 0.23 |
| NumDealsPurchases | -0.03 | -0.06 | 0.58 | -0.20 | 0.31 |
| NumWebPurchases | 0.20 | -0.09 | 0.45 | -0.07 | -0.31 |
| NumCatalogPurchases | 0.28 | -0.07 | -0.01 | 0.01 | 0.30 |
| NumStorePurchases | 0.25 | -0.18 | 0.19 | 0.26 | -0.15 |
| NumWebVisitsMonth | -0.20 | 0.14 | 0.38 | -0.25 | -0.08 |
| TotNumPurchased | 0.30 | -0.15 | 0.25 | 0.10 | -0.07 |
| TotAccepted | 0.19 | 0.53 | 0.00 | -0.12 | -0.06 |
| AvgPerPurchase | 0.30 | 0.02 | -0.04 | 0.01 | 0.24 |
| TotSpent | 0.34 | -0.02 | 0.01 | 0.06 | 0.16 |
Let's look at a plot of the first two of these:
pca2 = PCA(n_components = 2, random_state = 2)
data_pca = pca2.fit_transform(data_behavior_scaled)
transformed_data = pd.DataFrame(data_pca, columns = cols)
sns.scatterplot(transformed_data, x='PC1', y='PC2')
<Axes: xlabel='PC1', ylabel='PC2'>
We note that PCA's first two components - comprising 41% and 12% of the variance respectively, of the 72% explained - do produce interesting patterns that we expect to come out in our clustering algorithms.
Think about it:
Observation and Insights: First PCA dimension: largely centered around how much the customer spends overall.
Second PCA dimension: mostly relates to how targettable the customer is to marketing campaigns, this is obviously a highly useful dimension to know.
Third PCA dimension: largely about website usage and responsiveness to deals.
The other PCA dimensions don't have an easily interpretatable meaning beyond specific products being bought (wine vs. meat respectively).
Knowing this, we can start to make sense of our plot above. We note a gap in the lower left corner of the plot for the first 2 PCA components. This corner represents the customers who would be smaller spenders but are also targetable by marketing campaigns. The emptiness of this corner reflects that these customers are not discretionary spenders and can't be recruited by marketing campaigns. The straight lines in this part of the chart reflect that response to campaigns is strongly limited by disposible income.
#trying various values of perplexity
for i in range(5, 65, 7):
tsne = TSNE(n_components = 2, random_state = 2, perplexity = i)
data_tsne = tsne.fit_transform(data_behavior_scaled)
data_tsne = pd.DataFrame(data_tsne)
data_tsne.columns = ['X1','X2']
plt.figure(figsize = (7,7))
sns.scatterplot(x = 'X1', y = 'X2', data = data_tsne)
plt.title("perplexity = {}".format(i))
Observation and Insights: Most tSNE work at perplexity > 33 seems to return similiar clustering topology: one large "continent" in the lower X1 values (bound by x2=-5/2*x1+80) and a chain of maybe 3 main islands for larger X2. Let's stick with perplexity =33.
tsne = TSNE(n_components = 2, random_state = 2, perplexity = 33)
data_tsne = tsne.fit_transform(data_behavior_scaled)
data_tsne = pd.DataFrame(data_tsne)
data_tsne.columns = ['X1','X2']
plt.figure(figsize = (7,7))
sns.scatterplot(x = 'X1', y = 'X2', data = data_tsne)
plt.title("perplexity = {}".format(33))
Text(0.5, 1.0, 'perplexity = 33')
# Let's assign points to 4 different groups
def grouping(x):
first_component = x['X1']
second_component = x['X2']
if (second_component < -5/2*first_component + 80):
return 'group 1'
elif first_component < 40:
return 'group_2'
elif (second_component < -15):
return 'group_3'
else:
return 'group_4'
data_tsne['groups'] = data_tsne.apply(grouping, axis = 1)
# Scatter plot for two components with hue
plt.figure(figsize = (7, 7))
sns.scatterplot(x = 'X1', y = 'X2', data = data_tsne, hue = 'groups')
<Axes: xlabel='X1', ylabel='X2'>
#data_tsne.head()
data_tsne['groups'].value_counts()
group 1 1708 group_4 259 group_2 206 group_3 67 Name: groups, dtype: int64
This highly uneven splitting of group is not ideal. Group 1, for instance, contains 76% of all customers.
all_col = data_behavior_scaled.columns[:].tolist()
plt.figure(figsize = (30, 50))
for i, variable in enumerate(all_col):
plt.subplot(8, 3, i + 1)
sns.boxplot(y=data_behavior_scaled[variable], x=data_tsne['groups'])
plt.tight_layout()
plt.title(variable)
plt.show()
# importing plotly
import plotly.express as px
all_col = data_demo.columns[:].tolist()
plt.figure(figsize = (30, 50))
for i, variable in enumerate(all_col):
plt.subplot(8, 3, i + 1)
sns.boxplot(y=data_demo[variable], x=data_tsne['groups'])
#px.box(data_demo, y=data_demo[variable], x=data_tsne['groups'], points="all")
plt.tight_layout()
plt.title(variable)
plt.show()
The strongest separation seems to happen by number of web visits:
Think About It:
# Empty dictionary to store the SSE for each value of K
sse = {}
# Iterate for a range of Ks and fit the scaled data to the algorithm.
# Use inertia attribute from the clustering object and store the inertia value for that K
for k in range(1, 10):
kmeans = KMeans(n_clusters = k, random_state = 1).fit(data_pca)
sse[k] = kmeans.inertia_
# Elbow plotsil
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()), 'bx-')
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.show()
Given the sharp elbow at k=3, 3 clusters seems ideal.
# Empty dictionary to store the Silhouette score for each value of K
sc = {}
# Iterate for a range of Ks and fit the scaled data to the algorithm. Store the Silhouette score for that K
for k in range(2, 10):
kmeans = KMeans(n_clusters = k, random_state = 1).fit(data_pca)
labels = kmeans.predict(data_pca)
sc[k] = silhouette_score(data_pca, labels)
# Elbow plot
plt.figure()
plt.plot(list(sc.keys()), list(sc.values()), 'bx-')
plt.xlabel("Number of cluster")
plt.ylabel("Silhouette Score")
plt.show()
k=3 is our best clustering for k means.
cols = ['PC1', 'PC2', 'PC3']
pca2 = PCA(n_components = 3, random_state = 1)
data_pca = pca2.fit_transform(data_behavior_scaled)
transformed_data = pd.DataFrame(data_pca, columns = cols)
transformed_data_copy = data_behavior_scaled.copy(deep = True)
kmeans = KMeans(n_clusters = 3, random_state = 1)
kmeans_trans = KMeans(n_clusters = 3, random_state = 1)
kmeans.fit(data_behavior_scaled)
kmeans_trans.fit(transformed_data)
# Adding predicted labels to the original data and the scaled data
data_behavior_scaled_copy['KMeans_Labels'] = kmeans.predict(data_behavior_scaled)
transformed_data_copy['KMeans_Labels'] = kmeans_trans.predict(transformed_data)
data['KMeans_Labels'] = kmeans.predict(data_behavior_scaled)
transformed_data['KMeans_Labels'] = kmeans_trans.predict(transformed_data)
data['KMeans_Labels'].value_counts()
0 1217 2 803 1 220 Name: KMeans_Labels, dtype: int64
sns.scatterplot(y='PC2', x='PC1', data=transformed_data, hue='KMeans_Labels', palette='Dark2')
<Axes: xlabel='PC1', ylabel='PC2'>
all_col = data_behavior_scaled.columns[:].tolist()
plt.figure(figsize = (30, 50))
for i, variable in enumerate(all_col):
plt.subplot(8, 3, i + 1)
sns.boxplot(y=data_behavior_scaled[variable], x=transformed_data['KMeans_Labels'])
plt.tight_layout()
plt.title(variable)
plt.show()
all_col = data_demo.columns[:].tolist()
plt.figure(figsize = (30, 50))
for i, variable in enumerate(all_col):
plt.subplot(8, 3, i + 1)
sns.boxplot(y=data_demo[variable], x=transformed_data['KMeans_Labels'])
plt.tight_layout()
plt.title(variable)
plt.show()
Observations and Insights:
k=3 provided 3 clusters of data, but groups 0 and 1 don't split graphically as expected. The most conspicuous split is by income: Groups 0, 1, and 2 are low, medium, and high-income, respectively.
Think About It:
Summary of each cluster: We see a similiar grouping to what tSNE produced, in that groups are most strongly stratified by web visits and one group most strongly responds to campaingns
kmedo = KMedoids(n_clusters = 3, random_state = 5)
kmedo.fit(transformed_data)
data_behavior_scaled_copy['kmedoLabels'] = kmedo.predict(transformed_data)
data['kmedoLabels'] = kmedo.predict(transformed_data)
transformed_data['kmedoLabels']=kmedo.predict(transformed_data)
data['KMeans_Labels'].value_counts()
0 1217 2 803 1 220 Name: KMeans_Labels, dtype: int64
data.kmedoLabels.value_counts()
2 1079 1 582 0 579 Name: kmedoLabels, dtype: int64
Clusters are a bit more balanced than KMeans.
plt.figure(figsize = (15, 4))
plt.subplot(1, 2, 2)
sns.scatterplot(y='PC2', x='PC1', data=transformed_data, hue='kmedoLabels', palette='Dark2')
plt.subplot(1, 2, 1)
sns.scatterplot(y='PC2', x='PC1', data=transformed_data, hue='KMeans_Labels', palette='Dark2')
plt.show()
Similiar clustering as before, mostly by income and amount spent, but with KMedoids, the left-most group grew from from 220 to 1079,the right-most group shrank from 802 to 582, while the middle group shrank from 1217 to 579.
for col in data_behavior.columns:
sns.boxplot(x = 'kmedoLabels', y = col, data = data)
plt.show()
all_col = data_demo.columns[:-4].tolist()
plt.figure(figsize = (30, 50))
for i, variable in enumerate(all_col):
plt.subplot(8, 3, i + 1)
sns.boxplot(y=data_demo[variable], x=transformed_data['kmedoLabels'])
plt.tight_layout()
plt.title(variable)
plt.show()
Observations and Insights:
Summary for each cluster:
Observations and Insights:
from scipy.cluster.hierarchy import dendrogram, linkage
# The List of all linkage methods to check
methods = ['single',
'average',
'complete']
# Create a subplot image
fig, axs = plt.subplots(len(methods), 1, figsize = (20, 15))
# Enumerate through the list of all methods above, get linkage and plot dendrogram
for i, method in enumerate(methods):
Z = linkage(transformed_data, metric = 'euclidean', method = method)
dendrogram(Z, ax = axs[i]);
axs[i].set_title(f'Dendrogram ({method.capitalize()} Linkage)')
axs[i].set_ylabel('Distance')
Think about it:
Single Linkage does not cluster well in this case.
Complete Linkage does an okay job, but Average Linkage works the best for clustering separation. Cutting at a distance of 5 gives us 4 cluters.
plt.figure(figsize = (20, 7))
plt.title("Dendrograms")
dend = dendrogram(linkage(transformed_data, method = 'average'))
plt.axhline(y = 5, color = 'g', linestyle = '--')
<matplotlib.lines.Line2D at 0x7a251636a860>
# Clustering with 4 clusters
hierarchical = AgglomerativeClustering(n_clusters = 4, affinity = 'euclidean', linkage = 'average')
hierarchical.fit(transformed_data)
AgglomerativeClustering(affinity='euclidean', linkage='average', n_clusters=4)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
AgglomerativeClustering(affinity='euclidean', linkage='average', n_clusters=4)
data_behavior_scaled_copy['HCLabels'] = hierarchical.labels_
data['HCLabels'] = hierarchical.labels_
transformed_data['HCLabels'] = hierarchical.labels_
data_demo['HCLabels'] = hierarchical.labels_
data.HCLabels.value_counts()
3 1077 1 927 0 233 2 3 Name: HCLabels, dtype: int64
plt.figure(figsize = (15, 4))
plt.subplot(1, 2, 1)
sns.scatterplot(y='PC2', x='PC1', data=transformed_data, hue='KMeans_Labels', palette='Dark2')
plt.subplot(1, 2, 2)
sns.scatterplot(y='PC2', x='PC1', data=transformed_data, hue='HCLabels', palette='Dark2')
plt.show()
Results very similar to KMeans and KMedoids, in that we are mostly stratifying by how much the consumers spend. HC also located 3 uniquely targetable very high spenders in their own group in what we're calling "Group 2"
for col in data_behavior.columns:
sns.boxplot(x = 'HCLabels', y = col, data = data)
plt.show()
mean = round(data.groupby('HCLabels').mean(), 2)
median = round(data.groupby('HCLabels').median(), 2)
df_hierachical = pd.concat([mean, median], axis = 0)
df_hierachical.index = ['group_0 Mean', 'group_1 Mean', 'group_2 Mean', 'group_3 Mean', 'group_0 Median', 'group_1 Median', 'group_2 Median', 'group_3 Median']
df_hierachical[data_behavior.columns].T
| group_0 Mean | group_1 Mean | group_2 Mean | group_3 Mean | group_0 Median | group_1 Median | group_2 Median | group_3 Median | |
|---|---|---|---|---|---|---|---|---|
| MntWines | 833.01 | 470.78 | 26.33 | 46.64 | 847.00 | 418.00 | 5.0 | 23.00 |
| MntFruits | 62.64 | 42.03 | 6.33 | 4.96 | 44.00 | 26.00 | 1.0 | 2.00 |
| MntMeatProducts | 495.08 | 251.17 | 23.33 | 23.87 | 490.00 | 183.00 | 3.0 | 15.00 |
| MntFishProducts | 88.24 | 60.06 | 3.00 | 7.25 | 69.00 | 39.00 | 1.0 | 3.00 |
| MntSweetProducts | 68.20 | 42.29 | 4.67 | 5.12 | 51.00 | 27.00 | 1.0 | 2.00 |
| MntGoldProds | 74.97 | 69.64 | 1.33 | 15.40 | 55.00 | 50.00 | 1.0 | 10.00 |
| AcceptedCmp4 | 0.36 | 0.07 | 0.00 | 0.01 | 0.00 | 0.00 | 0.0 | 0.00 |
| AcceptedCmp5 | 0.65 | 0.01 | 0.00 | 0.00 | 1.00 | 0.00 | 0.0 | 0.00 |
| AcceptedCmp1 | 0.51 | 0.03 | 0.00 | 0.00 | 1.00 | 0.00 | 0.0 | 0.00 |
| Response | 0.60 | 0.11 | 0.00 | 0.08 | 1.00 | 0.00 | 0.0 | 0.00 |
| NumDealsPurchases | 1.17 | 2.91 | 15.00 | 2.03 | 1.00 | 2.00 | 15.0 | 2.00 |
| NumWebPurchases | 5.40 | 6.00 | 0.00 | 2.16 | 5.00 | 6.00 | 0.0 | 2.00 |
| NumCatalogPurchases | 5.97 | 4.26 | 0.33 | 0.58 | 6.00 | 4.00 | 0.0 | 0.00 |
| NumStorePurchases | 8.02 | 8.13 | 0.00 | 3.31 | 8.00 | 8.00 | 0.0 | 3.00 |
| NumWebVisitsMonth | 3.17 | 4.63 | 19.33 | 6.34 | 3.00 | 5.00 | 19.0 | 7.00 |
| TotNumPurchased | 19.39 | 18.38 | 0.33 | 6.06 | 19.00 | 18.00 | 0.0 | 6.00 |
| TotAccepted | 2.37 | 0.29 | 0.00 | 0.16 | 2.00 | 0.00 | 0.0 | 0.00 |
| AvgPerPurchase | 87.18 | 51.00 | 59.33 | 14.81 | 82.11 | 42.33 | 0.0 | 12.67 |
| TotSpent | 1622.14 | 935.97 | 65.00 | 103.24 | 1650.00 | 908.00 | 9.0 | 66.00 |
Group 2 Those 3 people fitting into Group2 Deal David. Deal David is an online window shopper, waiting for a deal. He is 5 times as likely to respond to a deal as anyone in any other category. And while he might spent a lot less overall, his average spent per purchase is more than almost any one else.
Group 0 is Heavy Spender Henry. He buys more of more products than anyone else and he is very targetable with campaigns.
Group 1 is a bit thrifty. She's economical Esther. He spends more than Deal David from Group 2 and makes nearly as many purchases as Heavy Spender Henry, but tends to make cheaper purchases in virtually every category.
Group 3 is the largest group, with 1077 of our customers. This is our Average Joe. Joe does not respond to campaigns, does not buy very many items or spend very much. Joe should be a last priority in marketing campaigns.
for col in data_demo.columns:
sns.boxplot(x = 'HCLabels', y = col, data = data_demo)
plt.show()
Observations and Insights:
Summary of each cluster:
DBSCAN is a very powerful algorithm for finding high-density clusters, but the problem is determining the best set of hyperparameters to use with it. It includes two hyperparameters, eps, and min samples.
Since it is an unsupervised algorithm, you have no control over it, unlike a supervised learning algorithm, which allows you to test your algorithm on a validation set. The approach we can follow is basically trying out a bunch of different combinations of values and finding the silhouette score for each of them.
# Empty dictionary to store the Silhouette score for each value of K
sc = {}
# Iterate for a range of Ks and fit the scaled data to the algorithm. Store the Silhouette score for that K
for eps in np.arange(0.5, 5.5, 0.5):
dbs = DBSCAN(eps=eps)
data_behavior_scaled_copy['DBSLabels'] = dbs.fit_predict(transformed_data)
data['DBSLabels'] = dbs.fit_predict(transformed_data)
labels = dbs.fit_predict(transformed_data)
#print("no of labels", len(np.unique(labels)))
if len(np.unique(labels))>1:
sc[eps] = silhouette_score(transformed_data, labels)
# Elbow plot
plt.figure()
plt.plot(list(sc.keys()), list(sc.values()), 'bx-')
plt.xlabel("Eps")
plt.ylabel("Silhouette Score")
plt.show()
Eps=1.5 seems to work best, though anything in the range 1.5-2.5 might be plausible, in case we've hit a saddle point. It might be that eps=1.5 is solely due to default min_samples choice. Let's tune min_samples similarly.
# Empty dictionary to store the Silhouette score for each value of K
sc = {}
# Iterate for a range of Ks and fit the scaled data to the algorithm. Store the Silhouette score for that K
for min_samples in range(2,20,1):
dbs = DBSCAN(eps=1.5, min_samples=min_samples)
data_behavior_scaled_copy['DBSLabels'] = dbs.fit_predict(transformed_data)
data['DBSLabels'] = dbs.fit_predict(transformed_data)
labels = dbs.fit_predict(transformed_data)
#print(silhouette_score(transformed_data, labels), " with min_samples", min_samples)
#print("no of labels", len(np.unique(labels)))
if len(np.unique(labels))>1:
sc[min_samples] = silhouette_score(transformed_data, labels)
# Elbow plot
plt.figure()
plt.plot(list(sc.keys()), list(sc.values()), 'bx-')
plt.xlabel("min_samples")
plt.ylabel("Silhouette Score")
plt.show()
min_samples=5 (which is the default) works well. min_samples=3 works slightly better but does not change clustering appearance (see below). Also, min_samples=3 might be an overfit. (Silh.=0.56)
For eps = 2.5, min_samples=20 works better (Silh. =0.56) For eps = 2, min_samples=5 again works best (Silh.=0.56)
For all the values of eps between 1.5 and 2.5, we have checked for robustness of results in clustering.
dbs = DBSCAN(eps = 1.5, min_samples=5)
data_behavior_scaled_copy['DBSLabels'] = dbs.fit_predict(transformed_data)
data['DBSLabels'] = dbs.fit_predict(transformed_data)
transformed_data['DBSLabels']=dbs.fit_predict(transformed_data)
data_demo['DBSLabels'] = dbs.fit_predict(transformed_data)
data['DBSLabels'].value_counts()
1 1077 0 922 2 228 -1 13 Name: DBSLabels, dtype: int64
Coincidentally, we also arrive at 4 groups. Let's see how it looks on the first two PCA components.
plt.figure(figsize = (15, 4))
plt.subplot(1, 2, 1)
sns.scatterplot(y='PC2', x='PC1', data=transformed_data, hue='KMeans_Labels', palette='Dark2')
plt.subplot(1, 2, 2)
sns.scatterplot(y='PC2', x='PC1', data=transformed_data, hue='DBSLabels', palette='Dark2')
plt.show()
Observations and Insights: Similar clustering to before, except that the DBSLabel = -1, is pulling 13 customers into their own group for no apparent reason. Because this is only the first two dimensions of PCA, maybe we're missing something in the next dimension, that the DBSCAN algorithim is picking up for this cluster.
plt.figure(figsize = (15, 4))
plt.subplot(1, 2, 1)
sns.scatterplot(y='PC2', x='PC3', data=transformed_data, hue='KMeans_Labels', palette='Dark2')
plt.subplot(1, 2, 2)
sns.scatterplot(y='PC2', x='PC3', data=transformed_data, hue='DBSLabels', palette='Dark2')
plt.show()
PC3 also shows a similar scattering of this -1 category that does not admit simple explanation.
mean = round(data.groupby('DBSLabels').mean(), 2)
median = round(data.groupby('DBSLabels').median(), 2)
df_hierachical = pd.concat([mean, median], axis = 0)
df_hierachical.index = ['group_-1 Mean', 'group_0 Mean', 'group_1 Mean', 'group_2 Mean', 'group_-1 Median', 'group_0 Median', 'group_1 Median', 'group_2 Median']
df_hierachical[data_behavior.columns].T
| group_-1 Mean | group_0 Mean | group_1 Mean | group_2 Mean | group_-1 Median | group_0 Median | group_1 Median | group_2 Median | |
|---|---|---|---|---|---|---|---|---|
| MntWines | 595.46 | 468.15 | 46.64 | 838.64 | 408.00 | 415.50 | 23.00 | 851.50 |
| MntFruits | 56.23 | 42.03 | 4.96 | 61.82 | 14.00 | 26.00 | 2.00 | 43.50 |
| MntMeatProducts | 359.00 | 248.80 | 23.87 | 500.89 | 161.00 | 181.50 | 15.00 | 500.00 |
| MntFishProducts | 51.23 | 60.03 | 7.25 | 88.74 | 12.00 | 39.00 | 3.00 | 69.00 |
| MntSweetProducts | 46.69 | 42.35 | 5.12 | 67.78 | 12.00 | 26.50 | 2.00 | 51.00 |
| MntGoldProds | 35.15 | 69.64 | 15.40 | 76.13 | 22.00 | 50.00 | 10.00 | 56.50 |
| AcceptedCmp4 | 0.31 | 0.07 | 0.01 | 0.36 | 0.00 | 0.00 | 0.00 | 0.00 |
| AcceptedCmp5 | 0.31 | 0.01 | 0.00 | 0.66 | 0.00 | 0.00 | 0.00 | 1.00 |
| AcceptedCmp1 | 0.46 | 0.03 | 0.00 | 0.50 | 0.00 | 0.00 | 0.00 | 0.00 |
| Response | 0.62 | 0.11 | 0.08 | 0.59 | 1.00 | 0.00 | 0.00 | 1.00 |
| NumDealsPurchases | 5.31 | 2.91 | 2.03 | 1.15 | 2.00 | 2.00 | 2.00 | 1.00 |
| NumWebPurchases | 4.69 | 5.99 | 2.16 | 5.43 | 4.00 | 6.00 | 2.00 | 5.00 |
| NumCatalogPurchases | 3.62 | 4.25 | 0.58 | 6.00 | 2.00 | 4.00 | 0.00 | 6.00 |
| NumStorePurchases | 5.08 | 8.13 | 3.31 | 8.08 | 6.00 | 8.00 | 3.00 | 8.00 |
| NumWebVisitsMonth | 8.15 | 4.62 | 6.34 | 3.14 | 7.00 | 5.00 | 7.00 | 3.00 |
| TotNumPurchased | 13.38 | 18.37 | 6.06 | 19.50 | 15.00 | 18.00 | 6.00 | 19.00 |
| TotAccepted | 1.92 | 0.29 | 0.16 | 2.33 | 1.00 | 0.00 | 0.00 | 2.00 |
| AvgPerPurchase | 65.32 | 50.90 | 14.81 | 87.68 | 70.74 | 42.33 | 12.67 | 82.06 |
| TotSpent | 1143.77 | 931.00 | 103.24 | 1633.99 | 1662.00 | 907.00 | 66.00 | 1654.00 |
Of course, averages will tell us nothing useful about Group -1, because while the points are scattered wildly, their averages will still settle towards centrality on each dimension (application of CLT?)
for col in data_behavior.columns:
sns.boxplot(x = 'DBSLabels', y = col, data = data)
plt.show()
for col in data_demo.columns:
sns.boxplot(x = 'DBSLabels', y = col, data = data_demo)
plt.show()
mean = round(data_demo.groupby('DBSLabels').mean(), 2)
print(mean)
TotDependents Age FamilySize Income Marital_Status_Married \
DBSLabels
-1 0.46 43.54 2.15 54476.94 0.23
0 0.80 49.56 2.65 63230.07 0.39
1 1.24 45.11 3.11 35589.86 0.39
2 0.21 46.76 2.07 80079.98 0.39
Marital_Status_Single Marital_Status_Together \
DBSLabels
-1 0.23 0.23
0 0.20 0.26
1 0.22 0.26
2 0.24 0.22
Marital_Status_Widow Education_Basic Education_Graduation \
DBSLabels
-1 0.00 0.00 0.54
0 0.04 0.00 0.51
1 0.03 0.05 0.49
2 0.06 0.00 0.52
Education_Master Education_PhD HCLabels
DBSLabels
-1 0.00 0.38 0.77
0 0.16 0.25 1.00
1 0.17 0.19 3.00
2 0.17 0.24 0.00
Summary of each cluster: These clusters aren't as illustrative. Again, we get that the highest income group (group 2) (only 228 customers) spends the most, and has the smallest family. Group 2 responds to campaigns, but not specials deals.
Group 1 is the largest, with 1077 and they have lowest income, bigger families, and buy a lot less. They don't respond to campaigns or especially to deals.
Group 0 has 922 customers. They don't respond to campaigns or especially to deals. They are middle income and spent a corresponding amount.
The mysteriously scattered Group -1 is the most educated (PhD) and the youngest. The only thing they seem to have in common is that they respond to deals and campaigns. There are 13 of them.
Three clusters have so far seems the best, but let's do one last tuning based on sil. score to confirm this.
sc = {}
# Iterate for a range of Ks and fit the scaled data to the algorithm. Store the Silhouette score for that K
for k in np.arange(2,8):
gmm = GaussianMixture(n_components = k, random_state = 1)
gmm.fit(transformed_data)
labels = gmm.predict(transformed_data)
#print("no of labels", len(np.unique(labels)))
if len(np.unique(labels))>1:
sc[k] = silhouette_score(transformed_data, labels)
# Elbow plot
plt.figure()
plt.plot(list(sc.keys()), list(sc.values()), 'bx-')
plt.xlabel("k")
plt.ylabel("Silhouette Score")
plt.show()
We confirm that 3 clusters is the best choice for GMM as well. (Silh.=0.56 again)
gmm = GaussianMixture(n_components = 3, random_state = 1)
gmm.fit(transformed_data)
data_behavior_scaled_copy['GmmLabels'] = gmm.predict(transformed_data)
data_behavior['GmmLabels'] = gmm.predict(transformed_data)
data['GmmLabels'] = gmm.predict(transformed_data)
data_demo['GmmLabels'] = gmm.predict(transformed_data)
transformed_data['GmmLabels'] = gmm.predict(transformed_data)
data.GmmLabels.value_counts()
1 1080 0 925 2 235 Name: GmmLabels, dtype: int64
Observations and Insights:
plt.figure(figsize = (15, 4))
plt.subplot(1, 2, 1)
sns.scatterplot(y='PC2', x='PC1', data=transformed_data, hue='KMeans_Labels', palette='Dark2')
plt.subplot(1, 2, 2)
sns.scatterplot(y='PC2', x='PC1', data=transformed_data, hue='GmmLabels', palette='Dark2')
plt.show()
plt.figure(figsize = (15, 4))
plt.subplot(1, 2, 1)
sns.scatterplot(y='PC2', x='PC3', data=transformed_data, hue='KMeans_Labels', palette='Dark2')
plt.subplot(1, 2, 2)
sns.scatterplot(y='PC2', x='PC3', data=transformed_data, hue='GmmLabels', palette='Dark2')
plt.show()
Cluster 0 shrunk a bit (n=925 here, n=1217 in Kmeans, n=579 in KMedoids, n=927 in HC, n=922 in DBS), but otherwise they look very similiar.
mean = round(data_demo.groupby('GmmLabels').mean(), 2)
print(mean)
mean = round(data_behavior.groupby('GmmLabels').mean(), 2)
print(mean)
TotDependents Age FamilySize Income Marital_Status_Married \
GmmLabels
0 0.80 49.54 2.65 63233.64 0.39
1 1.24 45.10 3.11 35503.76 0.39
2 0.21 46.69 2.07 79828.37 0.39
Marital_Status_Single Marital_Status_Together \
GmmLabels
0 0.20 0.26
1 0.22 0.26
2 0.24 0.23
Marital_Status_Widow Education_Basic Education_Graduation \
GmmLabels
0 0.04 0.00 0.51
1 0.03 0.05 0.49
2 0.06 0.00 0.52
Education_Master Education_PhD HCLabels DBSLabels
GmmLabels
0 0.16 0.25 1.00 -0.00
1 0.17 0.19 3.00 0.99
2 0.16 0.24 0.01 1.91
MntWines MntFruits MntMeatProducts MntFishProducts \
GmmLabels
0 467.90 41.91 250.21 59.93
1 46.58 4.97 23.87 7.24
2 841.26 62.93 496.81 88.50
MntSweetProducts MntGoldProds AcceptedCmp4 AcceptedCmp5 \
GmmLabels
0 42.28 69.64 0.08 0.01
1 5.12 15.36 0.01 0.00
2 68.01 74.90 0.36 0.66
AcceptedCmp1 Response NumDealsPurchases NumWebPurchases \
GmmLabels
0 0.03 0.11 2.92 5.98
1 0.00 0.08 2.07 2.16
2 0.50 0.60 1.17 5.47
NumCatalogPurchases NumStorePurchases NumWebVisitsMonth \
GmmLabels
0 4.25 8.12 4.62
1 0.58 3.30 6.37
2 5.98 8.06 3.20
TotNumPurchased TotAccepted AvgPerPurchase TotSpent
GmmLabels
0 18.35 0.29 50.88 931.88
1 6.04 0.16 14.93 103.13
2 19.51 2.35 87.33 1632.41
Summary of each cluster:
Observations and Insights: Very similar stories to what we've seen before. Group 2 (n=235) here is almost exactly the same group as DBS 2: highest income, smallest families, highest spenders, more store and catalog purchases, fewer website visits. Not susceptible to deals, but are easily responsive to marketing campaigns.
Group 1 (n=1080) is again the web window shoppers, lowest income, least educated, largest families, smallest spenders, who spent a lot of time on the website without buying very much. They are a bit more likely to respond to deals but not campaigns. This is also the group 1 from DBS.
Group 0 (n=925). Intermediate income, family size, and amounts of spending highly educated, they don't respond to marketing campaigns but are the most responsive to deals.
1. Comparison of various techniques and their relative performance based on chosen Metric (Measure of success):
2. Refined insights:
3. Proposal for the final solution design:
!pip install nbconvert
Requirement already satisfied: nbconvert in /usr/local/lib/python3.10/dist-packages (6.5.4) Requirement already satisfied: lxml in /usr/local/lib/python3.10/dist-packages (from nbconvert) (4.9.4) Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (4.12.3) Requirement already satisfied: bleach in /usr/local/lib/python3.10/dist-packages (from nbconvert) (6.1.0) Requirement already satisfied: defusedxml in /usr/local/lib/python3.10/dist-packages (from nbconvert) (0.7.1) Requirement already satisfied: entrypoints>=0.2.2 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (0.4) Requirement already satisfied: jinja2>=3.0 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (3.1.3) Requirement already satisfied: jupyter-core>=4.7 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (5.7.2) Requirement already satisfied: jupyterlab-pygments in /usr/local/lib/python3.10/dist-packages (from nbconvert) (0.3.0) Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (2.1.5) Requirement already satisfied: mistune<2,>=0.8.1 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (0.8.4) Requirement already satisfied: nbclient>=0.5.0 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (0.10.0) Requirement already satisfied: nbformat>=5.1 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (5.10.3) Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from nbconvert) (24.0) Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (1.5.1) Requirement already satisfied: pygments>=2.4.1 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (2.16.1) Requirement already satisfied: tinycss2 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (1.2.1) Requirement already satisfied: traitlets>=5.0 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (5.7.1) Requirement already satisfied: platformdirs>=2.5 in /usr/local/lib/python3.10/dist-packages (from jupyter-core>=4.7->nbconvert) (4.2.0) Requirement already satisfied: jupyter-client>=6.1.12 in /usr/local/lib/python3.10/dist-packages (from nbclient>=0.5.0->nbconvert) (6.1.12) Requirement already satisfied: fastjsonschema in /usr/local/lib/python3.10/dist-packages (from nbformat>=5.1->nbconvert) (2.19.1) Requirement already satisfied: jsonschema>=2.6 in /usr/local/lib/python3.10/dist-packages (from nbformat>=5.1->nbconvert) (4.19.2) Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.10/dist-packages (from beautifulsoup4->nbconvert) (2.5) Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.10/dist-packages (from bleach->nbconvert) (1.16.0) Requirement already satisfied: webencodings in /usr/local/lib/python3.10/dist-packages (from bleach->nbconvert) (0.5.1) Requirement already satisfied: attrs>=22.2.0 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.1->nbconvert) (23.2.0) Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.1->nbconvert) (2023.12.1) Requirement already satisfied: referencing>=0.28.4 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.1->nbconvert) (0.34.0) Requirement already satisfied: rpds-py>=0.7.1 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.1->nbconvert) (0.18.0) Requirement already satisfied: pyzmq>=13 in /usr/local/lib/python3.10/dist-packages (from jupyter-client>=6.1.12->nbclient>=0.5.0->nbconvert) (23.2.1) Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.10/dist-packages (from jupyter-client>=6.1.12->nbclient>=0.5.0->nbconvert) (2.8.2) Requirement already satisfied: tornado>=4.1 in /usr/local/lib/python3.10/dist-packages (from jupyter-client>=6.1.12->nbclient>=0.5.0->nbconvert) (6.3.3)
import nbconvert
import os
from nbconvert import HTMLExporter